Tacotron 2

Arxiv: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3121510%2F0bae893b-2454-1bfe-3155-9f918c64d7c0.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=b50ace47afb9c26eda68ee4d5da621a3

https://github.com/NVIDIA/tacotron2実装(Unofficial Repo/NVIDIA) Tacotron 2 (without wavenet)

Tacotron 2, a neural network architecture for speech synthesis directly from text.

The system is composed of a recurrent sequence-to-sequence feature prediction network

that maps character embeddings to mel-scale spectrograms,

followed by a modified WaveNet model

acting as a vocoder to synthesize time-domain waveforms from those spectrograms.

Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech

以下、多分間違ってるので、ちゃんと読んだら修正

少し前のTTSモデルの基礎アーキテクチャ？

vocoderとしてWaveNetを採用

今ならWaveGlowに置き換えたほうがよさそう

これによりTacotronからScore向上？

conditioning input(条件付き入力)をMel Spectrogramに変更

元は、linguistic, duration, and F0 features. (よくわからない)

MOS(Mean Opinion Score): 4.53

比較、professionally recorded speechは、MOS: 4.58

FYI

https://qiita.com/atsushi11o7/items/ea659641c354b4001fe4Tacotron2の実装について解説してみる

https://github.com/Vaibhavs10/open-tts-trackerVaibhavs10/open-tts-tracker